Data Cleaning with Python

Data quality aspects

Data validity constraints

Data cleaning workflow

1 Inspection 

2 Explore cleaning

3 Define & verifying pipeline

4 Implementing & reporting

Prerequisites

Data

Original data source is this Machine Learning dataset: Bank Marketing Data Set. It describes data (bank client and other attributes) used for marketing campaigns of a Portuguese banking institution. Variable definitions are available there.

Here, an abridged and adjusted set offered on Ilias will be used for training reasons.

Working on Google Colab

Working on local machine

Data inspection

Checking variable validity

First view on data

Validity: duplicates, datatypes, ranges, set-memberships, uniques, unknown/missing values

Data inspection with summaries

Bivariate

Data inspection by visualizations

Categorical

Numeric

Bi-/Trivariate

Data inspection with pandas-profiling

Explore cleaning

Dealing with incorrect data

Removing data

Correcting data

Unify inconsistent, but equal values OR aggregate variable levels

Imputing data

Imputing can conveniently done with scikit-learn.impute. Available are:

Of course, imputing can be done manually with Pandas, too.

Transforming data

Recode nominals into dummy variables

Recode ordinals as numeric

Standardize

Save preprocessed data